How factors such as gender, Age affected the speed and performance of more than 15,000 Marathon runners in 2022.
The project aims to conduct a comprehensive analysis of trends associated with gender and age in the context of the Dublin Marathon 2022. It had diversity of participants who ran the marathon which covers approximately 42.2 kilometers. The primary objectives include examining how the field is evolving in terms of speed, age distribution, and gender representation. Our analysis will delve into gender differentiation and aging stereotypes, exploring trends and anomalies within these categories. It also covers how the participants performed in terms of speed and how their initial performance affected the results.
The Dublin Marathon is an annual marathon that is held in Dublin, Ireland. It is a point-to-point course that starts at Fitzwilliam Square in central Dublin and finishes at Phoenix Park. The marathon is a popular event with over 15,000 participants each year.
The Data focuses on the primary factors such as age, Gender, the clubs that helped in training, the overall category. Apart from this the main performance data involves the time taken at different checkpoints of the marathon with the stage position at each interval.
Data is filtered based on the type of analysis performed. The columns which involved social media platforms such as YouTube, Photo, share was omitted as they were inappropriate for the analysis and had Null values. The Data was then separated into participants who finished the marathon and the one who did not finish. Any rows which contained missing values, null values were removed for the analysis.
The type of variables was not appropriate. The times were present as timestamps, which were converted into time in seconds. The string and character variables for the positions were converted to numeric in R for better analysis.
M <- read.csv("dublin2022marathon.csv")
#Deleting 3 features namely Photo, Youtube and Share as they are all null and have no influence on the race outcomes
M <- M[,!(names(M) %in% c("Photo", "YouTube", "Share"))]
#Replacing Missing Club values as No clubs
M <- M|> mutate(Club = coalesce(na_if(Club, ""), "No club"))
#Separating the ones who finished the race from those who did not finish
MF <- M[!(M$Stage.Position == '0'| M$Stage.Position == 'DNF' | M$Chip.Time ==''| M$Gun.Time ==''| M$Category ==''|M$Overall.Position =='DNF'| M$Stage.Position.1=='0'|M$Stage.Position.1=='DNF'|M$Stage.Position.2 == '0'| M$Stage.Position.2 == 'DNF' |M$Stage.Position.3 == '0'| M$Stage.Position.3 == 'DNF'| M$Stage.Position.4 == '0'| M$Stage.Position.4 == 'DNF'),]
#Function to convert the timestamps into seconds
convert_time <- function(t) {
tp <- as.numeric(strsplit(t, ":")[[1]])
return(tp[1] * 3600 + tp[2] * 60 + tp[3])
}
#Converting Timestamps to seconds
MF <- MF|> mutate(X10K= sapply(X10K,convert_time))
MF <- MF|> mutate(Gun.Time= sapply(Gun.Time,convert_time))
MF <- MF|> mutate(Chip.Time=sapply(Chip.Time,convert_time))
MF <- MF|> mutate(X20K= sapply(X20K,convert_time))
MF <- MF|> mutate(HALFWAY= sapply(HALFWAY,convert_time))
MF <- MF|> mutate(X30K= sapply(X30K,convert_time))
MF <- MF|> mutate(X40K= sapply(X40K,convert_time))
#converting the position data into numeric
MF$Category.Position <-as.numeric(MF$Category.Position)
MF$Gender.Position <-as.numeric(MF$Gender.Position)
MF$Stage.Position <-as.numeric(MF$Stage.Position)
MF$Overall.Position <- as.numeric(MF$Overall.Position)
MF$Chip.Position <- as.numeric(MF$Chip.Position)
MF$Stage.Position.1 <-as.numeric(MF$Stage.Position.1)
MF$Stage.Position.2 <-as.numeric(MF$Stage.Position.2)
MF$Stage.Position.3 <-as.numeric(MF$Stage.Position.3)
MF$Stage.Position.4 <-as.numeric(MF$Stage.Position.4)
#Renaming the column names
colnames(MF)[colnames(MF)=="X10K"]<- "Time_10Km"
colnames(MF)[colnames(MF)=="X20K"]<- "Time_20Km"
colnames(MF)[colnames(MF)=="HALFWAY"]<- "Halfway_21.3Km"
colnames(MF)[colnames(MF)=="X30K"]<- "Time_30Km"
colnames(MF)[colnames(MF)=="X40K"]<- "Time_40Km"
colnames(MF)[colnames(MF)=="Stage.Position"]<- "Position_10Km"
colnames(MF)[colnames(MF)=="Stage.Position.1"]<- "Position_20Km"
colnames(MF)[colnames(MF)=="Stage.Position.2"]<- "Position_25Km"
colnames(MF)[colnames(MF)=="Stage.Position.3"]<- "Position_30Km"
colnames(MF)[colnames(MF)=="Stage.Position.4"]<- "Position_40Km"New variables such as finished status and speed were captured at every milestone creating 7 new variables.
#Creating a new column for analyzing if the person improved their position from the first 10 Km
MF <- MF |> mutate(Improvement = if_else(Position_10Km>Overall.Position,"Yes","No"))
#Creating a new column to determining the average speed of a person(The race was 42.2Km long)
MF <- MF |> mutate(Speed_10 = 10000/Time_10Km)
MF <- MF |> mutate(Speed_20 = 10000/(Time_20Km-Time_10Km))
MF <- MF |> mutate(Speed_Halfway =1300 /(Halfway_21.3Km-Time_20Km))
MF <- MF |> mutate(Speed_30 = 8700/(Time_30Km-Halfway_21.3Km))
MF <- MF |> mutate(Speed_40 = 10000/(Time_40Km-Time_30Km))
MF <- MF |> mutate(Speed_Finish = 2200/(Gun.Time-Time_40Km))
MF <- MF |> mutate(Avg_speed = 42200/Gun.Time)
#Analyze the different mean speed of each categories at each intervals
mean_spd_10<- MF |> group_by(Category) |> summarize(Meanspd=mean(Speed_10))
mean_spd_20<- MF |> group_by(Category) |> summarize(Meanspd=mean(Speed_20))
mean_spd_Halfway<- MF |> group_by(Category) |> summarize(Meanspd=mean(Speed_Halfway))
mean_spd_30<- MF |> group_by(Category) |> summarize(Meanspd=mean(Speed_30))
mean_spd_40<- MF |> group_by(Category) |> summarize(Meanspd=mean(Speed_40))
mean_spd_Finish<- MF |> group_by(Category) |> summarize(Meanspd=mean(Speed_Finish))
SPEED<- MF |> group_by(Category) |> summarize(Meanspd=mean(Avg_speed))
MS<- MF[(MF$Category=='MS'),]
M40<- MF[(MF$Category=='M40'),]
M35<- MF[(MF$Category=='M35'),]
M45<- MF[(MF$Category=='M45'),]
FS<- MF[(MF$Category=='FS'),]
F35<- MF[(MF$Category=='F35'),]
M50<- MF[(MF$Category=='M50'),]The Gun time and Chip time that was captured for each participant were monitored. A scatter plot of Gun time vs Chip time performed observed the points to be scattered around the x=y line, we then fitted a linear regression model and observed there was a slight difference between the Gun time and Chip Time. Ideally Gun time and Chip time would be the same, but practically participants tend to start with a delay as reaction time differs from person to person.
#Just to check
CvG<-ggplot(MF,aes(Gun.Time,Chip.Time))+
geom_jitter(color="darkgreen")+
geom_smooth(aes(x = Gun.Time, y = predict(lm(Chip.Time ~ Gun.Time, data = MF)), color = "red")) +
labs(title="Gun Time vs Chip Time",x="Guntime",y="Chiptime")
#CvGThe Analysis is then conducted on a set of participants who finished the marathon based on Age, the clubs involved, the speed captured at each interval of the marathon and the performance that influenced the participants at 10k and 20K mark with a regression model.
#```{r plot1,fig.cap="Finished status of Female and Male Participants"}
ggplot(MF, aes(x = Gender, fill = Category)) +
geom_bar(stat = "count", position = "dodge") +
theme_minimal() +
labs(
x = 'Gender',
y = 'age category',
title = 'Age impact based on gender',
caption = 'Finished status of Female and Male Participants
based on Age category')The graph shows the distribution of gender by category and finish status for the Dublin Marathon 2022. The difference is most pronounced in the 35-45 category, where there are nearly twice as many males as females. The graph also shows that more males finish the Marathon than females in all categories. The difference is most pronounced in the 80+ category, where 70% of males have finished the marathon, compared to 50% of females. There are a few possible explanations for these trends. One possibility is that males are more likely to finish the marathon than females based on basic differences of stamina and strength.
#Separating those who completed the race into subsets according to Category (Age groups)
MS<- MF[(MF$Category=='MS'),]
M40<- MF[(MF$Category=='M40'),]
M35<- MF[(MF$Category=='M35'),]
M45<- MF[(MF$Category=='M45'),]
FS<- MF[(MF$Category=='FS'),]
F35<- MF[(MF$Category=='F35'),]
M50<- MF[(MF$Category=='M50'),]
#Might not be useful in the report
#boxplot(MS$Avg_speed)
#boxplot(M40$Avg_speed)
#boxplot(M35$Avg_speed)
#boxplot(M45$Avg_speed)
#boxplot(FS$Avg_speed)
#boxplot(F35$Avg_speed)
#boxplot(M50$Avg_speed)
#boxplot(MF$Avg_speed)
#Analyzing how many people outperformed in their category (Only those who fall above the 95% Confidence interval)
OP<-MF[((MF$Avg_speed>quantile(MF$Avg_speed,0.975))&(MF$Category=="F40"| MF$Category== "F45"| MF$Category=="M40"| MF$Category== "M45")),]
gg1<- ggplot(OP,aes(Category))+
geom_bar()+
labs(title="Outperformers in categories",x="Category",y="Number of people outperformed",caption = "Top performers in every category")+
theme_minimal()
# Pie chart
desired_genders <- c("Male", "Female")
desired_categories <- c("F40", "F45", "M40", "M45")
filtered_data <- MF[MF$Category %in% desired_categories & MF$Gender %in% desired_genders, ]
Hypothesis_data <- filtered_data %>% select(Gender,Category) %>%
mutate(Category = recode(Category,
"F40" = "U40", "M40" = "U40", "F45" = "O40", "M45" = "O40"))
gg2<- ggplot(Hypothesis_data, aes(x = "", fill = Category)) +
geom_bar(width = 1, stat = "count") +
coord_polar(theta = "y") +
labs(title = "Pie Chart of Category Distribution", caption="Over and Under 40 distribution")
plot_grid(gg1,gg2)#they are divided as under 40 (U45) and Over40 (O45)
tab <- table(Hypothesis_data)
#we can consider this as the analysis of two categorical(binary) variables
#Female or Male and U40 or O40
ptab <- prop.table(tab,2)
par(mfrow = c(1, 2))
#The people who finished are a sample of size n.
x <- c(1059,2031)
n <- c(1059+1034 ,2031+2182)
prop.test(x,n)##
## 2-sample test for equality of proportions with continuity correction
##
## data: x out of n
## X-squared = 3.0995, df = 1, p-value = 0.07832
## alternative hypothesis: two.sided
## 95 percent confidence interval:
## -0.00266457 0.05045059
## sample estimates:
## prop 1 prop 2
## 0.5059723 0.4820793
Based on the data driven from the highest population of participants who ran the Dublin marathon, the data is divided into two proportions of independent groups that determine if they are significantly different. Here the data has analysis of one binary variable that records the gender (male and Female). The age of each respondent was also recorded, as under 45, or over 45. Here, we are interested in whether finishers of the marathon vary with age group.
As per the proportion of data, the age group over 45 seems to have better performance than the participants under 45years of age.
The p-value of 0.07832 is greater than the commonly used significance level of 0.05. Therefore, you do not have enough evidence to reject the null hypothesis. The null hypothesis in this case is usually that there is no significant difference between the proportions in the two groups.
The 95% confidence interval provides a range of plausible values for the true difference in proportions. Since the interval includes 0, it suggests that the difference is not statistically significant. Note that this CI is approximate. The considerations highlight the complexity of the speed and performance that should be taken in to account while the age group are identified. This age group of men and women aged 40-45 years participating and finishing the marathon is more surprising on the health trends and dedication involved.
# Filter data for top 150 male athletes based on better overall positions
top_male_runners <- MF %>%
filter(Gender == "Male") %>%
top_n(150, wt = -Overall.Position)
# Create a box plot for overall position and chip time of clubs for top 150 male runners
CGG1<-ggplot(top_male_runners, aes(y = Chip.Time, x = Club , fill = Club)) +
geom_boxplot() +
labs(title = "Box Plot of Chip Time by Club for Top 150 Male Runners",
x = "Club",
y = "Chip Time",fill = "Club",caption = "Box plot displaying the performance distribution of the top 150 male runners in the Dublin Marathon, showcasing diversity among the leading male athletes in the event") +
theme_minimal()+
theme(axis.text.x = element_text(angle = 90, hjust = 1),legend.position = "none") # Remove y-axis labels
ggplotly(CGG1)Upon segmenting the dataset by gender, we observed intriguing patterns. Among male participants, those without club affiliations secured noteworthy initial positions, comprising a majority within the top 150 runners. However, an interesting trend emerged where club-affiliated male runners did not dominate the top positions initially.
For aspiring male runners aiming to participate and enhance their practice within the Dublin Marathon, consideration of joining prominent clubs such as CLONLIFFE HARRIERS A.C., ST. FINBARRS A.C., DONORE HARRIERS A.C., and PORTMARNOCK A.C. could significantly contribute to their training and performance improvements.
# Filter data for top 150 Female athletes based on better overall positions
top_female_runners <- MF %>%
filter(Gender == "Female") %>%
top_n(150, wt = -Overall.Position)
# Create a box plot for overall position and chip time of clubs for top 150 female runners
CGG2<-ggplot(top_female_runners, aes(y = Chip.Time, x = Club , fill = Club)) +
geom_boxplot() +
labs(title = "Box Plot of Chip Time by Club for Top 150 Female Runners",
x = "Club",
y = "Chip Time",
fill = "Club",caption ="Box plot displaying the performance distribution of the top 150 female runners in the Dublin Marathon, showcasing diversity among the leading female athletes in the event.") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90, hjust = 1),legend.position = "none")
ggplotly(CGG2)Among female participants, those without club associations performed comparably better than their club-affiliated counterparts. Interestingly, female clubs exhibited more dominance within the top 150 positions than their male counterparts. Notably, our analysis indicated that females affiliated with clubs demonstrated better performance than males within club associations.
Female runners looking to augment their marathon preparation and performance might benefit from affiliations with clubs like CLANE A.C., CRUSADERS A.C., SPORTSWORLD A.C., and ST COCA’S A.C. Our analysis suggests that these clubs have exhibited a positive impact on the practice and overall performance of female runners.
mean_spd_data<- rbind(
transform(mean_spd_10, At = "10K"),
transform(mean_spd_20, At = "20K"),
transform(mean_spd_Halfway, At = "Halfway(21.3K)"),
transform(mean_spd_30, At = "30K"),
transform(mean_spd_40, At = "40K"),
transform(mean_spd_Finish, At = "Finish_dash"),
transform(SPEED, At = "Overall")
)
## Speed analysis
SGG1<-ggplot(mean_spd_data, aes(x = Category, y = Meanspd, fill = At)) +
geom_bar(stat = "identity", position = position_dodge()) +
labs(title = "Speed Comparison by Category", x = "Category", y = "Speed (m/s)",caption = "Average speed of participants in a category in splits in metre/second") +
theme_minimal()
SGG1From the above graph, we can see the mean average speed of the participants in each category at every checkpoint/milestone. We can observe that the majority of participant categories start(averagely) at a reasonable pace and speed up till the halfway mark and then try to slowly reduce their speeds while losing their stamina. We can see that there is not much difference in the average speed in most categories between each checkpoint/milestone. We can come to a conclusion that most of the participants try to maintain their speed throughout the race. We cannot say this is true for all the participants as the data is too big for the average speeds to say about the top and bottom participants as well.
MF_subset <- head(MF, 10)
#Top 10 finishers
SGG2<- ggparcoord(MF_subset, columns = c(25,26,27,28,29,30),
groupColumn = 4 , scale = "globalminmax",
showPoints=TRUE,
alphaLines = 0.3, mapping = ggplot2::aes(linewidth = 1),
title = "Speed of Top 10") +
#geom_abline(data = MF_subset,slope=0,intercept = MF_subset$Avg_speed,linetype= "dashed" ,alpha=0.5)+
ggplot2::scale_linewidth_identity() +
scale_color_viridis(discrete=TRUE) +
theme_ipsum()+
theme(legend.position="Default",axis.text.x = element_text(angle = 90, hjust = 1),plot.title = element_text(size=13)) +
xlab("")
#Bottom 10 finishers
MF_subset_bot <- tail(MF,10)
SGG3<- ggparcoord(MF_subset_bot, columns = c(25,26,27,28,29,30),
groupColumn = 4 , scale = "globalminmax",
showPoints=TRUE,
alphaLines = 0.3, mapping = ggplot2::aes(linewidth = 1),
title = "Speed of Bottom 10") +
ggplot2::scale_linewidth_identity() +
scale_color_viridis(discrete=TRUE) +
theme_ipsum()+
theme(legend.position="Default",axis.text.x = element_text(angle = 90, hjust = 1),plot.title = element_text(size=13)) +
xlab("")
plot_grid(SGG2,SGG3)On further analysis of the top 10 finishers and the bottom 10 finishers, we can conclude that all of the players who finished at the top 10 tend to follow the common trend by maintaining a steady pace throughout the race and some of the bottom 10 finishers started fast (with their respective physical capabilities), tried to maintain the speed but slowed down as time went on.
The influence of performance and position at the 10 kilometer and 20 kilometer checkpoints can be modeled by the following equation.
\(Chip Time_i = \beta_0 + \beta_1X_1 +\beta_2X_2 + \beta_3X_3 +\beta_4X4 + \epsilon_i\)
where: \(Chip Time_i\) is the Chip Time in seconds for athlete \(i\), \(X_1\) is time in seconds taken by an athlete to complete 10 kilometers, \(X_2\) is the time in seconds taken by an athlete to complete 20 kilometers, \(X_3\) is the athlete’s position at the 10 kilometer mark while \(X_3\) is the athlete’s position at the 20 kilometer mark and \(\epsilon\) is the error term which is assumed to be both normally distributed and independent and identically distributed. \(\beta_0, \beta_1, \beta_2, \beta_3 , \beta_4\) are the parameters to be estimated.
reg_data <- MF |>
select(Time_10Km, Time_20Km, Chip.Time, Position_10Km, Position_20Km)
reg_data <- reg_data[complete.cases(reg_data),]
y <- reg_data$Chip.Time
index <- createDataPartition(y, p = .8, list = F)
train_data <- reg_data[index,]
test_data <- reg_data[-index,]
model <- lm(Chip.Time ~ ., data = train_data)correlation <- reg_data |> cor()
plot_ly(z = correlation,x = colnames(correlation),
y = rownames(correlation),
type = "heatmap") |>
layout(title = "Correlation Matrix Heatmap",
xaxis = list(categoryorder = "trace",
tickangle = -45,
tickfont = list(size = 13)),
yaxis = list(categoryorder = "trace",
tickangle = -45,
tickfont = list(size = 13)))The above heat map shows the correlation matrix of the variables used in the model. From the heatmap, all the variables are seen to be strongly positive correlation.
Before running the model, we split the data into training set (80%) and testing (20%). The table below shows the results of the regression model on the training data set. The model has an \(R^2\) of about 94% implying that 94% of variation in Chip Time is explained by the variation in the predictors.
| Dependent variable: | |
| Chip.Time | |
| Time_10Km | -3.088*** |
| (0.152) | |
| Time_20Km | 3.458*** |
| (0.070) | |
| Position_10Km | -0.368*** |
| (0.024) | |
| Position_20Km | 0.479*** |
| (0.023) | |
| Constant | 1,545.953*** |
| (98.055) | |
| Observations | 11,429 |
| R2 | 0.936 |
| Adjusted R2 | 0.936 |
| Residual Std. Error | 770.168 (df = 11424) |
| F Statistic | 41,805.900*** (df = 4; 11424) |
| Note: | p<0.1; p<0.05; p<0.01 |
Participants with longer 20K times, are on average, associated with slower chip times while participants with longer 10K times are associated with faster overall chip times. This indicates that the later part of the race (20K) has a stronger impact on overall performance than early times (10K). Participants find it challenging to sustain an aggressive early pace throughout the entire marathon distance a phenomenon called positive splitting.
This phenomenon is also consistent with the athlete’s position at 10Km and at 20Km we here we find that a worse position at 10 Km mark is associated with with lower cheap time. At 20 Km milestone, a worse position is associated with slower cheap time,
Having split the data into training and test sets, we tested our model on the training dataset.
pred <- predict(model, newdata = test_data)
test_data$predictions <- pred
mse <- mean((test_data$Chip.Time - pred)^2)
rmse = sqrt(mse)
# rmseMy the model has a root mean square error of 768.9655 implying that the model’s prediction deviate from the actual value by approximately 769 units.
p <- ggplot(data = test_data,
mapping = aes(x = Chip.Time, y = predictions)) +
geom_jitter() +
geom_abline(intercept = 0, slope = 1, color = "red", linetype = "dashed") +
geom_smooth(method = lm, se = F, color = "blue") +
labs(title = "Actual vs Predicted Chip Time",
x = "Actual Chip Time",
y = "Predicted Chip Time") +
theme(plot.title = element_text(hjust = .5))
ggplotly(p)The above plot shows that actual chip time vs the predicted chip time. The plot include points for each observation, a reference line (dashed) represents a perfect match (where actual equals predicted) and a regression line showing the relationship between actual and predicted values.
After examining and analyzing the data from different variables, we observed factors such as age, gender, clubs, speed and performance in the first half influenced the results of the participants in the marathon. The analysis from the hypothesis broke the stereotype that suggests younger participants to fare well in a marathon than those over 40s by having an almost identical distribution among the two. We found that the female population of the participants belonged to clubs more than men. The female affiliated with the clubs demonstrated better performance than the males in clubs. We found that the participants who ran the race at a steady pace were more successful than those who did not.
In conclusion, the data indicates a significant connection between marathon performance and specific milestones. Participants with longer 20K times generally exhibit slower overall chip times, emphasizing the importance of the later stages. The presence of positive splitting underscores the difficulty of maintaining an aggressive early pace. Notably, it is specifically a poorer position at the 10K mark that is associated with lower chip times, highlighting the crucial role of early race positioning in overall performance.